Conversation
Signed-off-by: seungrokj <seungrok.jung@amd.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
@seungrokj can u add in ur perf-changelog that VLLM_ROCM_USE_AITER=1 works now due to PR vllm-project/vllm#35641 |
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
|
@seungrokj any updates on this PR? |
|
@claude create a new PR that ports this PR over & uses |
|
Claude finished @functionstackx's task in 2m 59s —— View job Porting PR #909 with
|
Port changes from PR #909 with the resolved upstream docker image: - Upgrade image from v0.15.1 to v0.18.0 - Enable AITER MLA (VLLM_ROCM_USE_AITER=1) - Add --max-num-seqs 256, remove --disable-log-requests - Update perf-changelog Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-authored-by: seungrokj <seungrokj@users.noreply.github.com>
|
hi @functionstackx I tested a few thing to figure out non-deterministic behavior of kimik2 fp4 case. And now it's resolved. So now will work on "int4" model. Or, I'll update on top of mr. Klaud's work on #950 |
|
@seungrokj feel free to continue working on this! appreipcate ur help! |
|
i started an klaud cold PR cuz i thought u were busy with other tasks and didnt wanna delay adding kimi k2.5 int4 mi355 AITER which already is an easy win cuz it has great perf improvmenets |
Sure will work on this today! |
|
hi @seungrokj since #909 already passed validation & improvements is easy win & we ideally want to show these improvements on the frontend ASAP, i am gonna merge #909 . for any additional changes, can u build on top of #909 |
…which has the AITER MLA patch for num_heads=8 (#950) * [AMD/ROCm] kimik2.5 int4 mi355x: upgrade to vllm-openai-rocm:v0.18.0 Port changes from PR #909 with the resolved upstream docker image: - Upgrade image from v0.15.1 to v0.18.0 - Enable AITER MLA (VLLM_ROCM_USE_AITER=1) - Add --max-num-seqs 256, remove --disable-log-requests - Update perf-changelog Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-authored-by: seungrokj <seungrokj@users.noreply.github.com> * Update perf-changelog PR link to #950 Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-authored-by: seungrokj <seungrokj@users.noreply.github.com>
|
@functionstackx yes if we are just using tp8. Then #909 should be solid. It also aligned with internal measurement on |
|
@seungrokj do u see anything better with TP4? if TP4 is on the pareto frontier, feel free to add it |
|
@functionstackx based on fp4 case int4 (same memory footprint) could have better tput/gpu. On testing internally and if it looks good then will raise a subsequent PR. |
|
thanks! looking forward to ur follow up PR on if tp4 is better or not |

waiting for the optimized upstream docker image.
Regards,
Seungrok